For this exercise we're going to learn about how we use complex probability distributions to generate data. In other words, we're going to generate plausible, fake text from a body of text. To help out, we're going to need Python's NLTK and Numpy libraries.
In [115]:
import nltk, random, numpy, textwrap
In [3]:
nltk.corpus.gutenberg.fileids()
Out[3]:
NLTK has a neat corpus module with sets of test data to use. Within the files for project gutenberg, we've got a nice selection of classics to choose from. We'll be choosing "Moby Dick" because it is particularly long. Our first task is to read the data into a Python string and split it into words so we can process it.
In [7]:
training = nltk.corpus.gutenberg.open('melville-moby_dick.txt').read()
In [8]:
training_words = training.split()
Now we will take a closer look at the data.
In [9]:
training_words[0]
Out[9]:
Ah, well that's a problem. The first words of this file look like a header. So let's move a little ways down the list until we find a sentence start after the header. With a large enough body of text this won't matter, but for our purposes let's make it easier on our selves by making sure our data is higher quality. High quality data is the first step towards better predictions, which is why in practice data scientists spend much of their time on it.
In [15]:
for idx, (w1, w2) in enumerate(zip(training_words, training_words[1:])):
if w1 == "Call" and w2 == "me":
print "Ishmael", idx
break
Wow! That's pretty far down the page. Let's verify before moving forward.
In [16]:
print(training_words[3550:3650])
Okay, looks good. Now let's zoom in.
In [17]:
print(training_words[3608:3612])
Alright, 3608 is definitely our starting point. And now we see another issue too. If we're going to generate plausible text, we need to take into account sentences not just words.
Markov chains let us predict the probability one word will be followed by another word, mapping out a directed graph of state transitions between the different words in our body of text. In this way we find out something about grammar without having to encode it in our model. For more information, take a minute to read here: https://en.wikipedia.org/wiki/Markov_chain
In [129]:
def build_model(words):
last_word = "_START" # Initialize from a starting point
model = {"_START": {}}
for word in words:
try:
prev = model[last_word]
except KeyError:
model[last_word] = {word: 1}
else:
if word in prev:
prev[word] += 1
else:
prev[word] = 1
if word.endswith('.') or word.endswith('!'):
last_word = "_START"
else:
last_word = word
return model
Okay, now that we've created a function to build our model, let's try it out and see what we get.
In [130]:
mk = build_model(training_words[3608:])
print(len(mk))
It looks like we have a lot of top-level keys. This means there are 28558 states for which we've recorded transition frequencies. Let's check out our top starting words.
In [132]:
sorted(mk["_START"].items(), key=lambda (k,v): -v)[:10]
Out[132]:
If they're all capitalized like this, we're doing well. Now let's see if we can do something a little more complicated. Let's find the top ranked pairs of words by using a two level list comprehension.
In [48]:
pairs = sorted((-freq, (w1,w2)) for w1, follows in mk.items() for w2, freq in follows.items() if w1 != "_START")
In [49]:
for freq, (w1, w2) in pairs[:50]:
print "{}: {} {}".format(-freq, w1, w2)
Now we know the top pairs, but we have to make weighted random transitions, not just list the top pairs. With a little help from numpy generating a cumulative sum of frequencies so we can tranform [('The', 1), ('if', 7), ('as', 3)]
into [('The', 1), ('if', 8), ('as', 11)]
, we can make a weighted random choice for the next word given a particular prior word.
In [99]:
def random_choice(choices):
ch_list = list(sorted(choices.items(), key=lambda (k,v): v)) # [('w1', 1), ('w2', 3), ... ]
cs = list(numpy.cumsum([v for k, v in ch_list]))
cum_choices = [(cf,w) for cf, (w, f) in zip(cs, ch_list)]
choice = random.randint(1, cum_choices[-1][0])
for cf, value in cum_choices:
if choice <= cf:
return value
else:
raise ValueError("Random range out of bounds")
In [108]:
x = random_choice(mk["_START"])
In [109]:
print (x, mk['_START'][x])
Now that we've got a function to make random weighted choices, all that's left is to create our generation function. In the end, this is a simple algorithm:
In [119]:
def generate(min_words, weights):
words = []
while len(words) < min_words or not words[-1].endswith("."):
try:
prev = "_START" if words[-1].endswith(".") or words[-1].endswith("!") else words[-1]
except IndexError:
prev = "_START"
words.append(random_choice(weights[prev]))
return " ".join(words)
In [122]:
print (generate(30, mk))
Nice! Let's generate a longer one now. This might wrap in the middle of a word, so let's make use of Python's textwrap module to word wrap our output and generate a whole page.
In [123]:
print( "\n".join(textwrap.wrap(generate(500, mk), 80)))
We've done it! With no understanding of grammar or meaning, we've learned how to generate semi-plausible text. With some help from NLTK, we can improve this further by using stemming and punctuation detection to build a more complex probability distribution. NLTK can help a lot with this, but for now we'll stop here.